This document is an analysis of data provided by the Huron River Watershed Council as part of the 2017 A2 Data Dive.

Click the “Code” buttons above tables and plots to see the R code that generated them.

Data Preparation

Dam River Flow

We’ll start with pre-processed data measuring water flow below three dams on the Huron River. In order from north to south they are New Hudson, Hamburg, and Wall Street.

New Hudson has only daily readings and stops in 2014, but has no missing data. The other two have measurements in 15-minute intervals and have had some missing data imputed, while other gaps remain.

dam_tbls <- list(
  `New Hudson`  = "../Huron River Watershed Council/Data/new_hudson_preprocessed.csv",
  `Hamburg`     = "../Huron River Watershed Council/Data/hamburg_preprocessed.csv",
  `Wall Street` = "../Huron River Watershed Council/Data/wall_street_preprocessed.csv"
) %>%
  map(
    read_csv,
    skip = 1,
    col_names = c("datetime", "flow"),
    col_types = "Td"
  ) %>%
  map(arrange, datetime)

dam_tbl <- dam_tbls %>%
  bind_rows(.id = "site") %>%
  mutate(site = factor(site, levels = names(dam_tbls)))

dam_tbl %>%
  group_by(site) %>%
  slice(c(1:3, (n() - 2):n())) %>%
  kable()
site datetime flow
New Hudson 1990-01-01 00:00:00 109
New Hudson 1990-01-02 00:00:00 108
New Hudson 1990-01-03 00:00:00 108
New Hudson 2014-09-28 00:00:00 108
New Hudson 2014-09-29 00:00:00 104
New Hudson 2014-09-30 00:00:00 107
Hamburg 1990-01-03 00:00:00 212
Hamburg 1990-01-03 00:15:00 212
Hamburg 1990-01-03 00:30:00 212
Hamburg 2017-11-01 16:45:00 163
Hamburg 2017-11-01 17:00:00 163
Hamburg 2017-11-01 17:15:00 163
Wall Street 1990-01-01 00:00:00 358
Wall Street 1990-01-01 00:15:00 358
Wall Street 1990-01-01 00:30:00 333
Wall Street 2017-11-01 16:45:00 372
Wall Street 2017-11-01 17:00:00 372
Wall Street 2017-11-01 17:15:00 372

We can make sure the time intervals and endpoints match what we expect.

dam_tbl %>%
  group_by(site) %>%
  summarize(
    start    = min(datetime),
    stop     = max(datetime),
    n        = n(),
    avg_flow = mean(flow)
  ) %>%
  kable()
site start stop n avg_flow
New Hudson 1990-01-01 2014-09-30 00:00:00 9039 128.7847
Hamburg 1990-01-03 2017-11-01 17:15:00 871618 252.0428
Wall Street 1990-01-01 2017-11-01 17:15:00 959972 545.3998

Can we plot it?

dam_tbl %>% 
  ggplot(aes(x = datetime, y = flow, color = site)) +
  facet_wrap(~site, ncol = 1) +
  geom_line()

There’s a pretty big difference in magnitude of flow between the sites, which is expected as the cross-sections of the river are different sizes at each. We can also free the Y axis to see more of the within-site variance of flow.

dam_tbl %>%
  ggplot(aes(x = datetime, y = flow, color = site)) +
  geom_line() +
  facet_wrap(~site, ncol = 1, scales = "free_y")

This obscures any year-to-year similarity. Let’s superimpose years atop one another

set_year <- function(x, y) {
  year(x) <- y
  x
}

dam_tbl %>%
  mutate(
    year         = year(datetime),
    new_datetime = set_year(datetime, max(year))
  ) %>%
  select(site, new_datetime, year, flow) %>%
  filter(complete.cases(.)) %>% 
  ggplot(aes(x = new_datetime, y = flow, color = site)) +
  facet_wrap(~site, ncol = 1, scales = "free_y") +
  geom_line(aes(group = year, alpha = year)) +
  geom_line(data = . %>% group_by(site, new_datetime) %>% summarize(flow = mean(flow, na.rm = TRUE)), size = 1.5) +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b") +
  labs(
    x     = "Day of Year",
    y     = bquote('Flow ('*ft^3 / s*')'),
    alpha = "Year",
    title = "Flow by Site and Day-of-Year",
    subtitle = "with average line in bold"
  ) +
  guides(color = FALSE)

We can see the Wall Street data is much less smooth day-to-day than readings at the other sites. This could be due to

  • the flow gauge being much closer to the damn upstream than gauges at the other sites are
  • instrumentation or measurement error
  • issues in the imputatation already done on this data-
  • operational differences (this is the only automated damn)

Or some mix of multiple causes.

Missing Data

While some missing values have been imputed for Hamburg and Wall Street, we’ll make remaining missing values explicit.

dam_tbls <- dam_tbls %>%
  modify_at(
    c("Hamburg", "Wall Street"),
    ~ .x %>%
      complete(
        datetime = seq(min(.$datetime), max(.$datetime), by = as.difftime(15, units = "mins")),
        fill     = list(flow = NA_real_)
      )
  )

dam_tbl <- dam_tbls %>%
  bind_rows(.id = "site") %>%
  mutate(site = factor(site, levels = names(dam_tbls)))

dam_tbl %>%
  group_by(site) %>%
  summarize(
    start       = min(datetime),
    stop        = max(datetime),
    n           = n(),
    n_missing   = sum(is.na(flow)),
    pct_missing = n_missing / n,
    avg_flow    = mean(flow, na.rm = TRUE)
  ) %>%
  kable()
site start stop n n_missing pct_missing avg_flow
New Hudson 1990-01-01 2014-09-30 00:00:00 9039 0 0.0000000 128.7847
Hamburg 1990-01-03 2017-11-01 17:15:00 975814 104196 0.1067785 252.0428
Wall Street 1990-01-01 2017-11-01 17:15:00 976006 16034 0.0164282 545.3998

We see a lot of missing data in Hamburg, over 10%!

Rainfall

There’s also a set of rain gauge data from the Barton Pond, which is near the Wall Street gauge.

Rainfall is measured daily, in inches.

rain_tbl <- read_csv(
  "../Huron River Watershed Council/Data/barton_pond_raingauge_data.csv",
  skip = 1,
  col_names = c("date", "rainfall"),
  col_types = "Dd"
)

rain_tbl %>%
  head() %>%
  kable()
date rainfall
2009-11-24 0.00
2009-11-25 0.23
2009-11-26 0.04
2009-11-27 0.01
2009-11-28 0.00
2009-11-29 0.11

And a summary again

rain_tbl %>%
  summarize(
    start     = min(date),
    stop      = max(date),
    n         = n(),
    n_missing = sum(is.na(rainfall)),
    avg_rain  = mean(rainfall)
  ) %>%
  kable()
start stop n n_missing avg_rain
2009-11-24 2017-11-01 2900 0 0.0823

And a plot again

rain_tbl %>%
  ggplot(aes(x = date, y = rainfall)) +
  geom_line() +
  geom_smooth(method = "gam")

Here it is yearly:

rain_tbl %>%
  mutate(
    year     = year(date),
    new_date = set_year(date, max(year))
  ) %>%
  filter(!is.na(new_date)) %>%
  ggplot(aes(x = new_date, y = rainfall)) +
  geom_line(aes(group = year, alpha = year)) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  labs(
    x = "Day of Year",
    y = "Rainfall (in.)",
    alpha = "Year",
    title = "Year-over-Year Daily Rainfall",
    subtitle = "at Barton Pond"
  )

Very noisy day-to-day; how about monthly totals?

rain_tbl %>%
  group_by(year = year(date), new_date = floor_date(date, unit = "month") %>% set_year(max(year))) %>%
  summarize(rainfall = sum(rainfall)) %>%
  ggplot(aes(x = new_date, y = rainfall)) +
  geom_line(aes(group = year, alpha = year)) +
  geom_line(data = . %>% group_by(new_date) %>% summarize(rainfall = mean(rainfall)), size = 1.5) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  labs(
    x = "Month",
    y = "Rainfall (in.)",
    alpha = "Year",
    title = "Monthly Rainfall Totals",
    subtitle = "at Barton Pond"
  )

We see an unsurprising trend of higher rainfall in the spring, with less than half as much during winter.

Main Questions

Sudden Fluctuation in Flow

From the instructions (General Instructions.pdf):

Sudden fluctuations are changes in flow by 150% or above within a 12 hour period.

Since New Hudson is measured in daily intervals, we’ll do this with the other two first.

# 12 hours prior should be 12*4=48 points prior due to 15-minute interval
avg_flow_tbl <- dam_tbl %>%
  filter(site != "New Hudson") %>%
  group_by(site) %>% 
  arrange(datetime) %>%
  mutate(
    lag = lag(flow, 1L),
    past_avg_flow = roll_meanr(lag,       48, na.rm = TRUE),
    past_missing  = roll_sumr(is.na(lag), 48) 
  ) %>%
  ungroup() %>%
  select(-lag) %>%
  filter(!is.na(past_missing)) %>%                              # remove early entires without history
  mutate(pct_change = (flow - past_avg_flow) / past_avg_flow)

avg_flow_tbl %>%
  group_by(site) %>%
  slice(c(1:3, (n() - 2):n())) %>%
  kable()
site datetime flow past_avg_flow past_missing pct_change
Hamburg 1990-01-03 11:45:00 208 211.0638 1 -0.0145161
Hamburg 1990-01-03 12:00:00 210 211.0000 0 -0.0047393
Hamburg 1990-01-03 12:15:00 210 210.9583 0 -0.0045428
Hamburg 2017-11-01 16:45:00 163 162.7292 0 0.0016643
Hamburg 2017-11-01 17:00:00 163 162.7292 0 0.0016643
Hamburg 2017-11-01 17:15:00 163 162.7292 0 0.0016643
Wall Street 1990-01-01 11:45:00 353 345.9362 1 0.0204195
Wall Street 1990-01-01 12:00:00 363 346.0833 0 0.0488803
Wall Street 1990-01-01 12:15:00 363 346.1875 0 0.0485647
Wall Street 2017-11-01 16:45:00 372 380.8542 0 -0.0232482
Wall Street 2017-11-01 17:00:00 372 380.7500 0 -0.0229810
Wall Street 2017-11-01 17:15:00 372 380.6458 0 -0.0227136

How many of our samples are more than 150% or 100% higher than the past 12-hour average?

avg_flow_tbl %>%
  group_by(site) %>%
  summarize(
    n_over_150   = sum(pct_change >= 1.5, na.rm = TRUE),
    pct_over_150 = n_over_150 / n(),
    n_over_100   = sum(pct_change >= 1.0, na.rm = TRUE),
    pct_over_100 = n_over_100 / n()
  ) %>%
  kable()
site n_over_150 pct_over_150 n_over_100 pct_over_100
Hamburg 0 0.0000000 0 0.0000000
Wall Street 1768 0.0018116 4022 0.0041211

So we see there are no cases at the Hamburg site where flow was above 200% of the trailing 12-hour average, but there are a handful of such cases at the Wall Street site.

New Hudson

We can’t do quite the same calculation for New Hudson since it’s flow is only available per-day, but we can see which points are above the thresholds when compared to the prior days’ flow. This is a lot simpler to compute, and we also don’t have to sweat missing data.

new_hudson_tbl <- dam_tbl %>%
  filter(site == "New Hudson") %>%
  arrange(datetime) %>%
  mutate(
    past_avg_flow = lag(flow, 1L),
    pct_change = (flow - past_avg_flow) / past_avg_flow
  ) %>%
  filter(!is.na(past_avg_flow))

new_hudson_tbl %>%
  summarize(
    n_over_150   = sum(pct_change >= 1.5),
    pct_over_150 = n_over_150 / n(),
    n_over_100   = sum(pct_change >= 1.0),
    pct_over_100 = n_over_100 / n()
  ) %>%
  kable()
n_over_150 pct_over_150 n_over_100 pct_over_100
5 0.0005532 18 0.0019916

So we see just a few days have a flow over the the thresholds compared to the previous day; might as well pull them all.

new_hudson_tbl %>%
  filter(pct_change > 1.0) %>%
  arrange(desc(pct_change)) %>%
  kable()
site datetime flow past_avg_flow pct_change
New Hudson 1990-09-07 151.0 50.0 2.020000
New Hudson 2001-04-14 66.0 24.0 1.750000
New Hudson 2011-07-28 121.0 46.1 1.624729
New Hudson 2007-08-20 79.0 30.9 1.556634
New Hudson 2012-08-10 110.0 43.9 1.505695
New Hudson 1999-11-08 162.0 65.0 1.492308
New Hudson 2005-11-14 104.0 42.9 1.424242
New Hudson 1996-04-21 89.0 38.0 1.342105
New Hudson 2008-09-14 262.0 112.0 1.339286
New Hudson 1998-11-02 138.0 59.0 1.338983
New Hudson 2005-11-01 142.0 63.0 1.253968
New Hudson 2000-04-20 58.0 27.0 1.148148
New Hudson 2007-07-27 55.7 26.0 1.142308
New Hudson 2001-07-30 102.0 48.0 1.125000
New Hudson 2006-09-13 88.9 43.2 1.057870
New Hudson 2004-12-22 128.0 63.2 1.025316
New Hudson 2003-10-31 157.0 78.4 1.002551

Plotting them might help show if this is a solved problem, or if these threshold-breaking increases still happen.

avg_flow_tbl <- bind_rows(new_hudson_tbl, avg_flow_tbl)

avg_flow_tbl %>%
  filter(!is.na(pct_change)) %>%
  ggplot(aes(x = datetime, y = pct_change, color = site)) +
  facet_wrap(~site, ncol = 1, scales = "free_y") +
  geom_line() +
  geom_hline(yintercept = c(1, 1.5), linetype = 2) +
  labs(
    x = "Date",
    y = "Change in Flow",
    title = "Change in River Flow Over Time"
  ) +
  theme(legend.position = "none") +
  scale_y_continuous(labels = scales::percent)

Flow Targets

The USGS sets targets for a minimum level of flow beyond each dam. How often do we dip below those?

# manually extracted from "Example Numeric Flow Targets.docx"
flow_targets <- read_csv("../Huron River Watershed Council/Data/flow_targets.csv", col_types = "ciiii") %>%
  mutate(site = factor(site, levels = unique(site), labels = levels(dam_tbl$site)))

kable(flow_targets)
site base_flow min_flow spring_low_flow spring_high_flow
New Hudson 58 46 60 284
Hamburg 154 124 155 651
Wall Street 272 218 307 1665

We can count how often we’re below the minimums

dam_tbl %>%
  left_join(flow_targets, by = "site") %>%
  filter(!is.na(flow)) %>%
  group_by(site) %>%
  summarize(
    n            = n(), 
    n_low_flow   = sum(flow < min_flow),
    pct_low_flow = n_low_flow / n 
  ) %>%
  kable()
site n n_low_flow pct_low_flow
New Hudson 9039 783 0.0866246
Hamburg 871618 169229 0.1941550
Wall Street 959972 190867 0.1988256

So we see that most dams tend to experience some time below the threshold. The Wall Street site in particular spends a lot of time underneath the prescribed minimum flow.

Let’s plot our flows again, with the minimum flow limits superimposed.

temp <- dam_tbl %>%
  mutate(
    year         = year(datetime),
    new_datetime = set_year(datetime, max(year))
  ) %>%
  filter(complete.cases(.))

targets_long <- flow_targets %>%
  gather(limit, flow, -site)

limits <- targets_long %>%
  filter(limit %in% c("min_flow")) %>% 
  mutate(new_datetime = min(temp$new_datetime))

temp %>%
  ggplot(aes(x = new_datetime, y = flow, color = site)) +
  facet_wrap(~site, ncol = 1, scales = "free_y") +
  geom_line(aes(group = year, alpha = year)) +
  geom_line(data = . %>% group_by(site, new_datetime) %>% summarize(flow = mean(flow, na.rm = TRUE)), size = 1.5) +
  geom_hline(data = limits, aes(yintercept = flow), linetype = 2) +
  geom_text(
    data = limits,
    aes(x = new_datetime, y = flow, label = flow),
    vjust = -1, hjust = 1.1, color = "black", size = 3
  ) +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b") +
  labs(
    x     = "Day of Year",
    y     = bquote('Flow ('*ft^3 / s*')'),
    alpha = "Year",
    title = "Flow by Site And Year",
    subtitle = "with minimum flow guidelines (dotted line)"
  ) +
  guides(color = FALSE)

Spring Storm Season

We can also focus on the spring storm season (April 15 - June 30). What percent of the time are we outside the prescribed limits?

spring_only <- temp %>%
  filter(
    new_datetime >= as.POSIXct(paste0(unique(year(new_datetime)), "-04-15"), tz = "UTC"),
    new_datetime <  as.POSIXct(paste0(unique(year(new_datetime)), "-07-01"), tz = "UTC")
  ) %>%
  left_join(flow_targets, by = "site")

spring_only %>%
  select(site, flow, spring_low_flow, spring_high_flow) %>%
  filter(!is.na(flow)) %>%
  group_by(site) %>%
  summarize(
    n                = n(), 
    n_outside_limits = sum(pmap_dbl(list(flow, spring_low_flow, spring_high_flow), ~ !between(..1, ..2, ..3))),
    pct_outside      = n_outside_limits / n 
  ) %>%
  kable()
site n n_outside_limits pct_outside
New Hudson 1925 298 0.1548052
Hamburg 202653 38912 0.1920129
Wall Street 205344 45126 0.2197581

Again, plenty of operation outside the prescribed limits. We can see this visually by plotting flow once again and imposing the spring-specific guideline on top.

limits <- targets_long %>%
  filter(limit %in% grep("spring", limit, value = TRUE)) %>% 
  mutate(new_datetime = min(spring_only$new_datetime))

spring_only %>%
  ggplot(aes(x = new_datetime, y = flow, color = site)) +
  facet_wrap(~site, ncol = 1, scales = "free_y") +
  geom_line(aes(group = year, alpha = year)) +
  geom_line(data = . %>% group_by(site, new_datetime) %>% summarize(flow = mean(flow, na.rm = TRUE)), size = 1.5) +
  geom_hline(
    data = limits,
    aes(yintercept = flow), 
    linetype = 2
  ) +
  geom_text(
    data = limits,
    aes(x = new_datetime, y = flow, label = flow),
    vjust = -1, hjust = 1.1, color = "black", size = 3
  ) +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%B") +
  labs(
    x     = "Date",
    y     = bquote('Flow ('*ft^3 / s*')'),
    alpha = "Year",
    title = "Spring Flow by Site and Year",
    subtitle = "with flow guidelines (dotted lines)"
  ) +
  guides(color = FALSE)

What to do next